Workshop 2 - Poisson regression

Generalized linear models

Linear models are useful when the response (e.g. time) is a continuous quantity, eg, 3.1415, -2.71 etc.

Generalized linear models (GLMs) are used when the response is not continuous

  • logistic regression is used when the response is binary (0/1 or True/False).

    • Eg Whether a fire leads to a casualty
  • Poisson regression is used when the response is a non-negative integer value, e.g., 0, 12, 1034 etc.

    • Eg The number of fire incidents in a given region in a week.

Poisson distribution

Suppose \(Y\) is a variable which counts something (eg fire incidents).

We might model the distribution of \(Y\) as a Poisson distribution \[Y \sim Poisson(\lambda)\]

The mean (average) value of \(X\) is given by the free parameter \(\lambda\).

Poisson distribution

Y <- rpois(1000, lambda = 100)
hist(Y)

Poisson regression

In Poisson regression, we model the mean, \(\lambda\), using a linear combination of our covariates.

Example: Let \(Y_1, \ldots, Y_{365}\) be the number of fire incidents in Scotland on each day in the year 2022.

fire <-  read.csv("https://rich-d-wilkinson.github.io/docs/fire.csv")
head(fire, 10)
   day   weekday events
1    1    Monday    130
2    2   Tuesday    182
3    3 Wednesday    169
4    4  Thursday    170
5    5    Friday    217
6    6  Saturday    293
7    7    Sunday    181
8    8    Monday    151
9    9   Tuesday    138
10  10 Wednesday    185

A Poisson regression model would assume that \[Y_i \sim Poisson(\lambda_i)\]

The challenge is in specifying a suitable model for the means, \(\lambda_1, \ldots, \lambda_{365}\).

We could assume the mean number of incidents each day varies depending on the day of the week.

head(fire)
  day   weekday events
1   1    Monday    130
2   2   Tuesday    182
3   3 Wednesday    169
4   4  Thursday    170
5   5    Friday    217
6   6  Saturday    293

Let \(i=1, \ldots, 365\) indicate the day of the year.

A Poisson regression model assumes that the log of the mean relates linearly to the covariates. So for example, if \[Y_i \sim Poisson(\lambda_i)\] then we might model
\[\log \lambda_i = a_{day(i)}=\begin{cases} a_1& \mbox{ if day i is a Monday}\\ a_2 &\mbox{ if day i is a Tuesday}\\ \vdots\\ a_7 & \mbox{ if day i is a Sunday}\end{cases}\]

This model has 7 free parameters, \(a_1, a_2,\ldots, a_7\).

Fitting Poisson models in R

To fit a Poisson regression model in R, we just have to specify the linear formula for the log of the mean.

So in this case

\[\mathtt{events}\sim \mathtt{weekday-1}\]

The \(\mathtt{-1}\) gives us the correct parameterization.

pois.model <- glm(events~weekday-1, family=poisson(), data=fire)
pois.model

Call:  glm(formula = events ~ weekday - 1, family = poisson(), data = fire)

Coefficients:
   weekdayFriday     weekdayMonday   weekdaySaturday     weekdaySunday  
           5.298             4.996             5.710             5.275  
 weekdayThursday    weekdayTuesday  weekdayWednesday  
           5.142             5.052             5.146  

Degrees of Freedom: 364 Total (i.e. Null);  357 Residual
Null Deviance:      600500 
Residual Deviance: 395.7    AIC: 2982

This gives us the parameter estimates \(a_1, \ldots, a_7\).

Exercise: What happens if you instead use \(\mathtt{events}\sim \mathtt{weekday}\) as the model formula?

coef(pois.model)
   weekdayFriday    weekdayMonday  weekdaySaturday    weekdaySunday 
        5.297740         4.996432         5.710491         5.275265 
 weekdayThursday   weekdayTuesday weekdayWednesday 
        5.141776         5.052318         5.146488 

What do these numbers mean?

Remember we are modelling the log of the mean \[\log \lambda_i = a_{day(i)}\]

If we look at the log of the mean number of events per day,

library(dplyr)
fire |> group_by(weekday) |> summarise(average= mean(events)) |> mutate(log=log(average))
# A tibble: 7 × 3
  weekday   average   log
  <chr>       <dbl> <dbl>
1 Friday       200.  5.30
2 Monday       148.  5.00
3 Saturday     302.  5.71
4 Sunday       195.  5.28
5 Thursday     171.  5.14
6 Tuesday      156.  5.05
7 Wednesday    172.  5.15

we see this is the same as the parameter estimates.

Exercise

Consider a model that just distinguishes between weekdays and the weekend:
\[\log \lambda_i = \begin{cases} a &\mbox{ if day i is a week day}\\ b &\mbox{ if day i is a weekend day}\end{cases} \]

  • Fit this model in R.
    • you will first need to create an indicator of weekend vs week days which you can do as follows:
fire2 = fire |> mutate(weekend = weekday %in% c("Saturday", "Sunday"))
  • Check the parameter estimates are the log of the mean number of events on weekdays and weekends.

  • Compare this model to the previous model

pois.model2 <-  glm(events~weekend-1, family=poisson(), data=fire2)
pois.model2

Call:  glm(formula = events ~ weekend - 1, family = poisson(), data = fire2)

Coefficients:
weekendFALSE   weekendTRUE  
       5.132         5.516  

Degrees of Freedom: 364 Total (i.e. Null);  362 Residual
Null Deviance:      600500 
Residual Deviance: 2066     AIC: 4643
fire2 |> group_by(weekend) |> summarise(average= mean(events)) |> mutate(log=log(average))
# A tibble: 2 × 3
  weekend average   log
  <lgl>     <dbl> <dbl>
1 FALSE      169.  5.13
2 TRUE       249.  5.52

Comparing models

We can compare the models by splitting into test and training sets.

  • fit the model to the training data
  • predict the test data
  • compute the mean square prediction error
test.ind <- sample(1:364, 100) # 100 test points
fire.test = fire2 |> slice(test.ind)
fire.train = fire2 |> slice(-test.ind)
model1.train <- glm(events~weekday-1, family=poisson(), data=fire.train)
model2.train <- glm(events~weekend-1, family=poisson(), data=fire.train)

pred1 <- predict(model1.train, newdata=fire.test, type='response')
pred2 <- predict(model2.train, newdata=fire.test, type='response')

#MSE
sum((fire.test |> dplyr::select(events) - pred1)^2)/100
[1] 242.8843
sum((fire.test |> dplyr::select(events) - pred2)^2)/100
[1] 1251.816

So the model that assumes a different mean for each day of the week has a much smaller MSE than the model that only distinguishes weekend from week days.

Visualizing fitted models

library(visreg)
visreg(pois.model, xvar='weekday', scale='response')

Visualizing fitted models

library(visreg)
visreg(pois.model, xvar='weekday', scale='linear')

Exposure

Often in Poisson models we have to consider the exposure level for each event.

Suppose we have data on the number of fire incidents in a number of different Scottish towns.

  • \(y_1 =\) number of incidents in town A in the year 2020. Town A contains 1,000 properties.
  • \(y_2 =\) number of incidents in town B in the years 2020 and 2021. Town B contains 1,000 properties.
  • \(y_3 =\) number of incidents in town C in the year 2000. Town C contains 10,000 properties.

Everything else being equal, we would expect \[y_1 < y_2 < y_3\]

We model the varying opportunity for incidents using an exposure variable.

Fundamental unit

We model the rate of incidents per some fundamental unit

  • incidents per property per year
  • incidents per 1,000 properties per day
  • etc.

The choice of unit doesn’t change the fundamental model, we just need to be consistent.

If we make the fundamental unit incidents per property per year, then the exposure is the number of years times the number of properties in the twon \[Exposure = \#years \times \#properties \]

Our model for the expected number of events per unit, \(y\), corresponding to covariate information \(x\) is

\[E(y|x) = \mbox{Exposure} \times {\rm e}^{a+bx_1+cx_2+\ldots}\]

Poisson regression equations

If we take the logarithm of both sides the model becomes

\[\log E(y|x) = \log(Exposure) + a+bx_1+cx_2+\ldots \]

  • The \(\log(Exposure)\) term is called the offset
    • we need to specify this as an offset, not a covariate, in R
glm(y~offset(log(exposure))+x1+x2`, family='poisson(link=log)')
  • the linear part of the equation is as before \[a+bx_1+cx_2+\ldots\]

Example

Let’s download a new dataset which has data on the number of fire events in Paisley, St Andrews, and Stirling.

fire2 <-  read.csv("https://rich-d-wilkinson.github.io/docs/fire2.csv")
fire2$weekday <- factor(fire2$weekday, levels=c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'))
fire2$town <- as.factor(fire2$town)

# This orders the days of the week in the usual way.
head(fire2, 10)
    X day   weekday events    town nppty
1   1   1    Monday     10 Paisley 25231
2   2   2   Tuesday      6 Paisley 25231
3   3   3 Wednesday      8 Paisley 25231
4   4   4  Thursday     13 Paisley 25231
5   5   5    Friday     10 Paisley 25231
6   6   6  Saturday     14 Paisley 25231
7   7   7    Sunday     15 Paisley 25231
8   8   8    Monday      9 Paisley 25231
9   9   9   Tuesday      7 Paisley 25231
10 10  10 Wednesday     13 Paisley 25231

Exercise: What could you plot here to understand the data?

EDA

library(ggplot2)
ggplot(fire2, aes(x=weekday, y=events)) + 
    geom_boxplot()

EDA

library(ggplot2)
ggplot(fire2, aes(x=weekday, y=events, fill=town)) + 
    geom_boxplot()

Numerical summaries

fire2 |> group_by(town, weekday) |>summarise(mean(events)) |>print(n=21)
# A tibble: 21 × 3
# Groups:   town [3]
   town       weekday   `mean(events)`
   <fct>      <fct>              <dbl>
 1 Paisley    Monday             10.4 
 2 Paisley    Tuesday            11.6 
 3 Paisley    Wednesday          11.6 
 4 Paisley    Thursday           11.2 
 5 Paisley    Friday             13.8 
 6 Paisley    Saturday           20.9 
 7 Paisley    Sunday             13.9 
 8 St Andrews Monday              5.42
 9 St Andrews Tuesday             5.71
10 St Andrews Wednesday           5.81
11 St Andrews Thursday            6.02
12 St Andrews Friday              6.67
13 St Andrews Saturday           10.3 
14 St Andrews Sunday              6.67
15 Stirling   Monday              7.96
16 Stirling   Tuesday             8.85
17 Stirling   Wednesday           8.71
18 Stirling   Thursday            9.06
19 Stirling   Friday             10.7 
20 Stirling   Saturday           16.5 
21 Stirling   Sunday             10.8 

Model fitting

We can see from the plots that the number of incidents varies by day and town.

The simplest sensible model would assume the town variation can be explained by the differing exposure (number of properties).

fit1<-glm(events~offset(log(nppty))+weekday-1,  family=poisson(link=log), data=fire2)
print(fit1)

Call:  glm(formula = events ~ offset(log(nppty)) + weekday - 1, family = poisson(link = log), 
    data = fire2)

Coefficients:
   weekdayMonday    weekdayTuesday  weekdayWednesday   weekdayThursday  
          -7.534            -7.442            -7.443            -7.435  
   weekdayFriday   weekdaySaturday     weekdaySunday  
          -7.265            -6.842            -7.259  

Degrees of Freedom: 1092 Total (i.e. Null);  1085 Residual
Null Deviance:      32280000 
Residual Deviance: 1987     AIC: 6419

To make predictions, we need to specify the exposure (nnpty);

predict(fit1, newdata=list(weekday='Saturday', nppty=25231))
       1 
3.294191 
predict(fit1, newdata=list(weekday='Monday', nppty=25231), type='response')
       1 
13.48324 

Model fitting

A more complex model would be to assume the three towns have different rates even after accounting for exposure:

fit2<-glm(events~offset(log(nppty))+weekday*town, family=poisson(link=log), data=fire2)
fit2

Call:  glm(formula = events ~ offset(log(nppty)) + weekday * town, family = poisson(link = log), 
    data = fire2)

Coefficients:
                    (Intercept)                   weekdayTuesday  
                      -7.789963                         0.103148  
               weekdayWednesday                  weekdayThursday  
                       0.103148                         0.074503  
                  weekdayFriday                  weekdaySaturday  
                       0.279360                         0.692226  
                  weekdaySunday                   townSt Andrews  
                       0.287682                         0.768195  
                   townStirling    weekdayTuesday:townSt Andrews  
                       0.370722                        -0.051323  
weekdayWednesday:townSt Andrews   weekdayThursday:townSt Andrews  
                      -0.034628                         0.029794  
   weekdayFriday:townSt Andrews   weekdaySaturday:townSt Andrews  
                      -0.071943                        -0.051866  
   weekdaySunday:townSt Andrews      weekdayTuesday:townStirling  
                      -0.080264                         0.002212  
  weekdayWednesday:townStirling     weekdayThursday:townStirling  
                      -0.013122                         0.054490  
     weekdayFriday:townStirling     weekdaySaturday:townStirling  
                       0.017339                         0.035346  
     weekdaySunday:townStirling  
                       0.014389  

Degrees of Freedom: 1091 Total (i.e. Null);  1071 Residual
Null Deviance:      2598 
Residual Deviance: 1110     AIC: 5570

This model includes interaction terms, i.e., each combination of town and day is given a different mean.

Visualizing model fit

library(visreg)
visreg(fit2, by='town', xvar='weekday')

Exercise

Which model is best?

test.ind <- sample(1:1092, 300) # 300 test points
fire.test = fire2 |> slice(test.ind)
fire.train = fire2 |> slice(-test.ind)
model1.train <- glm(events~offset(log(nppty))+weekday, family=poisson(link=log), data=fire.train)
model2.train <- glm(events~offset(log(nppty))+weekday*town, family=poisson(link=log), data=fire.train)

pred1 <- predict(model1.train, newdata=fire.test, type='response')
pred2 <- predict(model2.train, newdata=fire.test, type='response')

#MSE
sum((fire.test |> dplyr::select(events) - pred1)^2)/100
[1] 49.68993
sum((fire.test |> dplyr::select(events) - pred2)^2)/100
[1] 30.27288

So model 2 (with interactions) has a much smaller MSE than model 1.

Poisson regression in the CRIM

In the CRIM, we model the number of incidents per day per property.

  • the exposure is the number of days we count for times the number of properties in the data zone:

\[Exposure = \#days \times \#properties \]

  • Let \(y_{ijk}\) be the number of incidents recorded of risk level \(k\) in data zone \(i\) and ACORN category \(j\)

  • Let \(x_i\) be the standardised SIMD rank of datazone \(i\).

We considered models of the type

\[\log E(y_{ijk}|x_{i}) = \log(Exposure) + \mu + \alpha_j + \beta_k + \gamma_{jk} + \delta_{jk} x_i\]

which we can fit using commands such as

glm(formula = n_incidents ~ offset(log(n_properties * n_dates)) +
ACORN_CAT * Risk_Level * SIMD_Rank_standardised, family = poisson(link = log),data = dat_Pois2)

Expanding the model

  • Do we need all the interactions?

  • Do we need an effect per datazone? \[\log E(y_{ijk}|x_{i}) = \log(Exposure) + \mu +a_i+ \alpha_j + \beta_k + \gamma_{jk} + \delta_{jk} x_i\] This requires random effects models.

  • Is there value in adding new covariates

glm(formula = n_incidents ~ offset(log(n_properties * n_dates)) +NEW_COVARIATE +
ACORN_CAT * Risk_Level + SIMD_Rank_standardised, family = poisson(link = log),data = dat_Pois2)

With all model development, important to remember the modelling pipeline

  1. Propose model
  2. Fit model to data
  3. Test model predictions + visualize model

flowchart LR
  A[Plot data] --> C{Propose model}
  C --> D[Fit model to data]
  D --> E[Test + visualise model ]
  E --> C